By the end of this training, you will be able to:
Logistic regressions are used to understand the association between a response variable, which is binary, and selected predictors. We can commonly find this type of analysis in ecological studies where the binary response variable consists of the presence (1) or absence (0) of a given species, resulting in predictions of habitat suitability.
In a linear regression, we can predict quantitative outcomes based on the value of one or more predictors using a straight line. We can calculate correlation and test for significance of the regression. We can also compare different types of model, from more simple, with single predictors, to more complicated, with several predictors and interactions, and then find the ones that provide a best fit to our data. In a logistic regression, we can also do all of that. The main difference is that our outcomes are binary as opposed to continuous measurements. Examples of binary outcomes are presence versus absence, as we mentioned before, or positive to a disease versus negative, or dead versus alive. These outcomes can be categorized as 1 (success) and 0 (failure), as in the hypothetical data below:
# Create and visualize hypothetical data
set.seed(123)
df <- data.frame(predictor = runif(100, 0, 10))
df$outcome <- rbinom(100, 1, 1 / (1 + exp(-1 * (0.5 * df$predictor - 2.5))))
knitr::kable(head(df))
| predictor | outcome |
|---|---|
| 2.875775 | 0 |
| 7.883051 | 1 |
| 4.089769 | 0 |
| 8.830174 | 0 |
| 9.404673 | 1 |
| 0.455565 | 0 |
When we plot this type of binary data, we see that observations cluster around either of the two outcome possibilities.
This type of data is best fit by an s-shaped curve instead of a line. And this is another difference between linear and logistic regressions.
The logistic curve represents the probability of a given positive outcome depending on a given predictor. As we move along the curve which progresses as we change our predictor, we go from 0 to 100% probability of our outcome. As we are interested in classifying our outcomes in success or failure, we establish a threshold along the logistic line: above that threshold, all outcomes will be positive (1), while below that, everything will be negative (0).
Before we can correlate variables in our models, we transform the response variable to get a linear relationship between variables. This transformation is called logit, and consists of the log of the probability of success/probability of failure, so that our explanatory variable is in log(odds). This is the equation for the logistic regression:
\[ Log (p/1-p) = b0+b1*x1+e \] Log
(p/1-p): response variable
b0: y intercept
x: predictor variable
b1: slope for explanatory variable 1
e: error
If we had more explanatory variables, we could keep adding them to the equation above as b2*x2 and so forth.
Just like in a linear regression, we can use continuous and/or discrete variables to make predictions. Here are some examples:
Continuous variables
Temperature variables such as minimum temperature of the coldest month and degree days to predict the habitat suitability of giant hogweed in North America (Cuddington et al., 2022)
Nestling weight to predict probability of survival in great tits (Tinbergen & Boerlijst, 1990)
Road density to predict habitat suitability in gray wolves (Mladenoff, Sickley, & Wydeven, 1999)
Discrete variables
setwd("C:/Users/user/Desktop/DFO_training")
library("readxl")
# xlsx files
ham <- read_excel("ind dates flat all parameters Hamilton only FOR Kim.xlsx")
analysis <- ham
knitr::kable(analysis[c(25:30), c(2,8,26,47,98)],row.names = FALSE, digits=3, align=rep('c', 3),
col.names = c("Location", "Season", "Total phosphorus", "Filamentous diatom biomass", "Epimilion temperat."))
| Location | Season | Total phosphorus | Filamentous diatom biomass | Epimilion temperat. |
|---|---|---|---|---|
| deep | 2 | 0.008 | 0.000 | 21.788 |
| deep | 2 | 0.008 | 3.100 | 21.826 |
| NE | 1 | 0.020 | NA | 19.891 |
| deep | 1 | 0.020 | NA | 19.706 |
| NE | 1 | NA | NA | 13.897 |
| deep | 1 | NA | 287.182 | 16.049 |
analysis <- analysis[!is.na(analysis$filamentous_Diatom), ]
analysis <- analysis[!is.na(analysis$mean_mixing_depth_temp), ]
Ensure the reference group is the first to be shown
analysis$filam_presence <- ifelse(analysis$filamentous_Diatom > 0, "present", "absent")
analysis$filam_presence <- factor(analysis$filam_presence)
levels(analysis$filam_presence)
## [1] "absent" "present"
analysis <- subset(analysis, (season==2 | season==3) & (!area_group=="deep"))
If using proportions, provide “weights” information (total number of trials, or events where we got success + events where we got failure)
model <- glm(filam_presence ~ mean_mixing_depth_temp,
data = analysis, family = binomial)
summary(model)
##
## Call:
## glm(formula = filam_presence ~ mean_mixing_depth_temp, family = binomial,
## data = analysis)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.1581 3.8402 2.645 0.00816 **
## mean_mixing_depth_temp -0.4202 0.1736 -2.420 0.01551 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79.807 on 69 degrees of freedom
## Residual deviance: 72.195 on 68 degrees of freedom
## AIC: 76.195
##
## Number of Fisher Scoring iterations: 5
library(sjPlot)
plot_model(model, type="pred", terms=c("mean_mixing_depth_temp[all]"))